Subtype and cell type specific expression of lncRNAs provide insight into breast cancer

Long non-coding RNAs (lncRNAs) are involved in breast cancer pathogenesis through chromatin remodeling, transcriptional and post-transcriptional gene regulation. We report robust associations between lncRNA expression and breast cancer clinicopathological features in two population-based cohorts: SCAN-B and TCGA. Using co-expression analysis of lncRNAs with protein coding genes, we discovered three distinct clusters of lncRNAs. In silico cell type deconvolution coupled with single-cell RNA-seq analyses revealed that these three clusters were driven by cell type specific expression of lncRNAs. In one cluster lncRNAs were expressed by cancer cells and were mostly associated with the estrogen signaling pathways. In the two other clusters, lncRNAs were expressed either by immune cells or fibroblasts of the tumor microenvironment. To further investigate the cis-regulatory regions driving lncRNA expression in breast cancer, we identified subtype-specific transcription factor (TF) occupancy at lncRNA promoters. We also integrated lncRNA expression with DNA methylation data to identify long-range regulatory regions for lncRNA which were validated using ChiA-Pet-Pol2 loops. lncRNAs play an important role in shaping the gene regulatory landscape in breast cancer. We provide a detailed subtype and cell type-specific expression of lncRNA, which improves the understanding of underlying transcriptional regulation in breast cancer.

T ranscriptional programs shape cancer cell phenotypes. In breast cancer, clinically relevant subtypes have been identified based on gene expression. Luminal A, Luminal B, Her2-enriched, Basal-like, and Normal-like subtypes have different natural histories, prognosis, and responses to therapies. These subtypes can be identified based on the expression of 50 genes (PAM50) 1 .
Luminal A/B tumors are typically estrogen receptor (ER) positive, with Luminal B having a higher expression of proliferationrelated genes. Her2-enriched tumors overexpress genes belonging to the ERBB2 pathway, while Basal-like tumors are usually negative for both ER and Her2, and for the progesterone receptor, and to a high degree reflect the triple-negative subgroup 2 .
We have previously shown that transcriptional programs between these subtypes are different and underlie their classification 3 . However, phenotypic heterogeneity within each subgroup pertains and could help to further refine subtyping and individualized treatment options.
Long non-coding RNA (lncRNA) expression is highly cell type and tissue specific 4,5 . lncRNAs play important roles in gene regulation, both at the transcriptional and posttranscriptional levels and may help shaping cell type and tissue phenotypes. Several studies have shown that genomic location of lncRNA overlap with enhancer regions 6,7 , and that lncRNA promoters may contain chromatin marks associated both with active promoters and enhancers 8 .
A substantial number of lncRNAs are tethered to chromatin at or near transcription start sites and can modulate transcription in cis through the recruitment of transcription factors (TF) and chromatin modifiers 9,10 . One of the possible effects of lncRNAs in regulating other genes' transcription is through the modulation of enhancer activity and recruitment of proteins that establish and stabilize chromatin conformation 11,12 . In breast cancer, lncRNAs have been implicated in tumor progression, resistance to treatment 13 and in activating the transcriptional network leading to metastasis 14 . Subtype-specific lncRNA expression has previously been described, particularly in the The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) cohort, however, with limited statistical power and validation 13,15,16 . In addition, tumor-specific epigenetic alterations have been identified at lncRNA promoters 15 , and lncRNA function has been assessed through pathway enrichment of neighboring genes 16 . Yet the function and regulation of the majority of lncRNAs in breast cancer pathogenesis remains unknown, especially in a subtype-specific manner.
In this study we identify the robust association of lncRNA expression to clinicopathological features in two large populationbased cohorts: the Sweden Cancerome Analysis Network -Breast (SCAN-B) initiative and the TCGA-BRCA. Using co-expression analysis of lncRNAs with protein-coding genes, we reveal the cell type-specific expression of lncRNA in breast tumors. To further understand the regulatory network driving lncRNA expression in breast cancer, we combine the clinical and genomic annotation of lncRNA with epigenetic data, transcription factor binding sites, and long-range chromatin interaction information.

Results
lncRNA expression according to breast cancer clinicopathological subtypes. To identify lncRNAs expressed by specific breast cancer subtypes or associated with clinicopathological features, we analyzed RNA-sequencing data from two large independent breast cancer cohorts: SCAN-B (n = 3455) 17 and TCGA-BRCA (n = 1095).
We focused on lncRNAs annotated in the Ensembl 18 v93 noncoding reference transcriptome ( Supplementary Fig. 1), and identified 4108 lncRNAs present in both cohorts, which are further analyzed in this study. A small number of lncRNAs (100 in SCAN-B, 37 in TCGA) were expressed >1 transcript per million (TPM) in all patients, but the majority of lncRNAs were expressed at lower levels. The two cohorts differ in the distribution of patients expressing lncRNAs>1TPM (Supplementary Fig. 2). Such sparsity of the lncRNA expression in the two cohorts highlights the importance of analyzing at least two independent breast cancer cohorts to robustly identify the lncRNA associated with clinicopathological features. Hierarchical clustering of the log2 expression values of the 4108 lncRNAs clearly grouped ER positive from ER negative patients, (Fig. 1a: SCAN-B and Fig. 1b: TCGA), indicating an association between breast cancer subtypes and lncRNA expression.
To further identify the lncRNAs associated with breast cancer subtypes, we performed differential expression analysis using linear modeling and empirical Bayes moderation. We report lncRNAs with significant differential expression according to the ER status (Fig. 1c) and HER2 status (Fig. 1d). The significant Pearson correlation between the log fold change (FC) in the SCAN-B and the TCGA cohorts (r = 0.93, p-value < 2.2e-16: ER status and r = 0.75, p-value < 2.2e-16: HER2 status) show that we identify with high confidence lncRNA differentially expressed according to pathological breast cancer subtypes.
On each plot (Fig. 1c, d), we highlight the lncRNAs with the highest absolute fold changes in each breast cancer subgroup. Detailed results from the differential expression analysis are available in Supplementary Data 1. FOXCUT was the most significantly deregulated lncRNA over-expressed in ER negative tumors with the highest fold change in both SCAN-B and TCGA, it has been previously shown to enhance proliferation and migration in ER negative breast cancer cell lines 19 .
We further performed all pairwise differential expression analyses within the five molecular PAM50 subtypes, Luminal A, Luminal B, Her2-enriched, Basal-like and Normal-like. Figure 1e shows the results of such analysis for Luminal A versus Luminal B, two subtypes considered to be closely related, as they are both ER positive; however, we still report 1448 differentially expressed lncRNA between these two subtypes. All pairwise comparisons considering PAM50 subtypes are presented in Supplementary  Fig. 3 and Supplementary Data 1.
Few lncRNAs have been associated to patient outcome 20 . To assess the relevance of lncRNA expression robustly and systematically with regards to breast cancer prognosis, we performed Cox proportional hazards regression analysis in the SCAN-B cohort in ER + and ER patients separately. 305 lncRNAs were significantly associated to overall survival of ER + patients in the SCAN-B cohort, of which MAPT-AS1, AP000851.1, AP000851.2, and ROCR could be validated in TCGA-BRCA (Supplementary Data 2). MAPT-AS1 has been previously shown to be associated with better patient outcome in breast cancer patients 21 . ROCR, the lncRNA with highest expression in the Luminal A subtype was also associated with ER + prognosis. 77 lncRNAs were associated to overall survival within the ER-group in the SCAN-B cohort, however, none of these were significantly associated with survival after multiple testing correction in the TCGA-BRCA cohort.
To our knowledge, this initial analysis is the first to robustly identify differentially expressed lncRNAs according to breast cancer clinicopathological features and molecular subtypes in two large and independent cohorts.
Clustering lncRNAs according to high degree of co-expression with protein coding mRNAs. To associate lncRNA expression to known biological functions, we used a co-expression approach (Supplementary Fig. 4a) between lncRNA (n = 4108) and protein coding genes´mRNA (n = 17060). Retaining the significant Spearman correlation coefficients of all lncRNA-mRNA associations in both cohorts (Bonferroni corrected p-value < 0.05), led to n = 15407856 significant correlations. On average, each lncRNA was significantly correlated with the expression of 95 mRNAs ( Supplementary Fig. 4b), while each mRNA was on average correlated with 20 lncRNAs (Supplementary Fig. 4c). Among the lncRNAs associated to the expression of the highest number of mRNAs, we found a non-coding RNA activated by DNA damage (NORAD), known to regulate genome stability 22 , as well as other lncRNAs with known function in DNA-damage response, including the estrogen responsive LINC01488 23 .
We then performed unsupervised clustering of the significant correlations with an absolute Spearman Rho >0.4 and involving lncRNAs and mRNAs with more than the average number of significant correlations ( Supplementary Fig. 4b, c). All significant correlations fulfilling these criteria are denoted in Supplementary Data 3. We identified three lncRNA clusters (x-axis) which correlated with three mRNA clusters (y-axis) (Fig. 2a). Interestingly, most of the correlation coefficients (99.8%) were positive, showing more positive correlations between mRNA and lncRNA than expected by chance. To assess whether the discovery of our three biclusters was driven by the filtering criteria used to select lncRNA and mRNA, an unsupervised clustering including all lncRNAs and mRNAs allowed the rediscovery of the three biclusters, however with much more sparsity ( Supplementary  Fig. 5).
To link the clustered lncRNAs to the differential expression analysis performed according to breast cancer subtypes, we  Her2 status, as well as PAM50 subtypes are annotated at the top of the heatmap. The expression gradient (blue to red) represents scaled and centered log2(TPM + 1). c-e Dot plot of the log Fold Change (FC) from the differential expression analysis using a fitted Limma model (lmfit) and moderated t-statistic (eBayes) between patients of different subtypes in SCAN-B (xaxis) and TCGA-BRCA (y-axis). Each dot represents a lncRNA, while the colour indicates the subtype with the highest expression c ER positive (blue) and ER negative (red), d Her2 negative (dark blue) and Her2 positive (pink). annotated lncRNAs according to whether they were found overexpressed in the respective groups compared in Fig. 1c-e (column annotations of the heatmap). We observed that lncRNAcluster 1 and 3 were populated by lncRNAs overexpressed in ER positive and ER negative cases, respectively.
Grouping lncRNAs into pathways related to breast cancer pathogenesis. Following the unsupervised clustering (Fig. 2a), we found a high degree of significant and dominantly positive correlations between the (i) lncRNAs in cluster 1 and the mRNAs in cluster A, (ii) lncRNA-cluster 2 and mRNA-cluster B and   (iii) lncRNA-cluster 3 and mRNA-cluster C. By performing Gene Set Enrichment Analysis (GSEA) using the genes of each mRNAcluster as input, we could infer by association the pathways the lncRNA-clusters may functionally be associated with.
lncRNA-cluster 1 & mRNA-cluster A -Estrogen signaling cluster. 91% of the lncRNAs in cluster 1 were significantly overexpressed in ER positive cases, when compared to ER negative, associating these lncRNAs with estrogen signaling. Further, we found that GSEA analysis of genes in mRNA-cluster A were significantly associated with the estrogen signaling pathway (Fig. 2b, Supplementary Data 3).
lncRNA cluster 2 & mRNA-cluster B -Fibroblast cluster. 52% of the lncRNAs in cluster 2 were significantly overexpressed in ER positive and 21% in ER negative cases. Interestingly, 87% were overexpressed in Luminal A tumors when compared to Luminal B. According to GSEA, genes of mRNA-cluster B are involved in Epithelial to Mesenchymal Transition (EMT) and Apical junctions (Fig. 2c, Supplementary Data 3). There is a high similarity between mesenchymal cells and fibroblasts 24 , and fibroblasts are strongly associated with shaping of the extra cellular matrix 25 . In addition, fibroblasts have been shown to be highly abundant in Luminal A breast tumors 26 . We therefore hypothesized that lncRNAs from cluster 2 may be expressed by fibroblasts of the tumor microenvironment.
lncRNA cluster 3 & mRNA-cluster C -Immune cluster. For the third lncRNA-cluster, 61% of the lncRNA were overexpressed in ER negative tumors. Protein coding genes of mRNA cluster C were highly correlated with lncRNA-cluster 3 and enriched among pathways reflecting activation of the immune system (Fig. 2d, Supplementary Data 3). Given the fact that ER negative tumors have significantly higher immune infiltration than ER positive tumors 27 , we hypothesized that lncRNAs from cluster 3 may be expressed by immune cells and / or modulate the immune tumor microenvironment.
Predicting cell type expression of lncRNAs. Having set hypotheses on which pathways and cell types clustered-lncRNA may be associated with, we aimed at providing further evidence for the cell type specific expression of lncRNAs using different approaches.
First, we modeled lncRNA expression as a multivariate function of ESR1 mRNA expression, fibroblast and lymphocyte infiltration scores reflecting fibroblast or lymphocyte tumor content. We tested which of the three variables explained best each lncRNA's expression (Supplementary Data 4). To infer the lymphocyte and fibroblast content and calculate lymphocyte and fibroblast scores, we used bulk gene expression and the Nanodissect [23] or xCell 28 algorithms, respectively (see Methods).
We found that ESR1 expression, fibroblast score, and lymphocyte score were the most significant explanatory variables for 82% of lncRNAs in cluster 1, 60% of lncRNAs in cluster 2 and 84.5% of lncRNA in cluster 3, respectively. Furthermore, when comparing the logistic regression coefficients, which reflect how much each variable explains lncRNA expression, we found that in average the ESR1-coefficients were significantly higher in cluster 1 ( These detailed analyses clearly divide lncRNA expressed in breast cancer in three categories, they are either expressed by cancer cells and belong to the estrogen signaling pathways or they are expressed by the main cell types of the tumor microenvironment: lymphocytes and fibroblasts. Expression of lncRNAs in breast cancer cell lines and single cell RNA-seq data. To clearly associate lncRNAs with cell type specific expression in breast cancer, we investigated lncRNA expression in a panel of breast cancer cell lines 29 . Differential expression analysis of breast cancer cell lines according to molecular subtypes confirmed that the lncRNAs with significantly high expression in Luminal A and Luminal B (ER + ) cell lines belonged to cluster 1. The top three lncRNAs for both Luminal subtypes are shown in Supplementary Fig. 7a, b. Overall, cluster 1 lncRNAs were expressed at higher levels in the Luminal cell lines ( Supplementary Fig. 7c, d). Cluster 2 lncRNAs, which we identify as mainly being expressed in fibroblasts of the tumor microenvironment, showed highest expression in cell lines of the Normal-like subtype. In cluster 3, 20% of the lncRNAs were not expressed in any breast cancer cell lines, but the remaining cluster 3 lncRNAs had the highest expression in Basal, Claudin-low, and Normal-like, ER-cell lines ( Supplementary Fig. 7e-h). All the lncRNAs that significantly defined each subtype cell lines from the rest are included in Supplementary Data 5.
We next interrogated single cell RNA sequencing data from a study of 26 breast cancer patients 30 . Following dimensionality reduction and clustering of the 94357 cells from the study by Wu et al., we observed that the cluster of cells obtained overlapped perfectly with the cell type annotation published by the authors 30 , which included nine main cell types: normal epithelial, cancer epithelial, myeloid, T-cells, B-cells, endothelial cells, plasmablasts, Cancer Associated Fibroblasts (CAFs) and perivascular-like (PVL)-fibroblasts (Fig. 3a). Clustering of lncRNA into relevant pathways for breast cancer. a Hierarchical clustering of lncRNA-mRNA Spearman correlation values (positive correlation in red, negative correlation in blue) following co-expression analysis between lncRNAs (n = 4108) and protein coding mRNAs (n = 17060). Only lncRNA and mRNA with significant correlation (Bonferroni p-value < 0.05) and −0.4> Spearman's rho > 0.4 in the TCGA (n = 1095) and SCAN-B (n = 3455) cohorts are used in the unsupervised clustering. In addition, we plot only lncRNAs and mRNAs with number of association higher than the mean value of association ( Supplementary Fig. 4). Clusters are defined using cutree_rows = 3 and cutree_cols = 3. lncRNAs (x-axis) are annotated according to the differential expression analysis (Fig. 1). b, d Bar plot showing -log(FDR q.value) from a hypergeometric test (y-axis) of gene set enrichment analysis using Hallmark pathways of the MSigDB database. Input genes for GSEA are genes from mRNA-cluster A (n = 2890) (b), mRNA-cluster B (n = 1480) (c), and mRNA-cluster C (n = 667)(d). Boxplot of the coefficients from the generalized linear modeling of the expression of lncRNAs in the SCAN-B cohort using three variables into the same model, ESR1 mRNA (to reflect estrogen signaling (e)), fibroblast score (to infer fibroblast tumor content (f)) and lymphocyte score (to infer lymphocyte infiltration (g)). Each dot represents the coefficient for a variable and each lncRNA in cluster 1 (n = 610), cluster 2 (n = 199), and cluster 3 (n = 110). Kruskal-Wallis test p-values are shown. The line within each box represents the median. Upper and lower edges of each box represent 75th and 25th percentile, respectively. The whiskers represent the lowest datum still within [1.5 × (75th − 25th percentile)] of the lower quartile, and the highest datum still within [1.5 × (75th − 25th percentile)] of the upper quartile.
Dot plot analysis which reflects both average expression and percentage of cells expressing lncRNAs was performed for the lncRNAs with the highest logistic regression coefficient associated with each cluster characteristic feature (i.e, ESR1 expression for cluster 1, fibroblast infiltration for cluster 2, and immune infiltration for cluster 3) (Fig. 3b, d). We confirmed that lncRNAs of cluster 1 were expressed at higher levels in cancer epithelial cells, cluster 2-lncRNAs were mainly expressed by cancer associated fibroblasts, while lncRNAs of cluster 3 were expressed by immune cells. We further illustrate the expression of GATA3-AS1 (Fig. 3e), NR2F1-AS1 (Fig. 3f) and LINC00861 (Fig. 3g) on a Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP). LINC00861 has been shown to be expressed in T-Cells in the tissue microenvironment (TME) of lung adenocarcinoma patients and was associated with better prognosis 31 . This lncRNA was also associated with better outcome in ER-patients in the SCAN-B cohort (Supplementary Data 2). Additional illustrations of lncRNA expression on the UMAP are included as Supplementary Fig. 8.
With these analyses, we directly identified lncRNA expression in either breast cancer cells, including cell lines, immune cells, or fibroblast of the breast tumor microenvironment.
Transcriptional regulation of expression at lncRNA promoters. lncRNAs are typically co-expressed with protein coding mRNA neighboring genes 4 . We aimed at characterizing lncRNAs regulatory regions in breast cancer.
To focus only on lncRNA specific regulatory regions and avoid analyzing regulatory regions from protein coding genes, we selected lncRNAs for which the promoter regions (transcription start site (TSS) −200/+100 bp) did not overlap with protein coding genes (Fig. 4a, Supplementary Data 6). Indeed, lncRNAs with promoters overlapping with protein coding genes had a higher level of co-expression with neighboring protein coding genes than independent lncRNAs and the nearest protein coding mRNA ( Supplementary Fig. 9). We therefore further analyzed the promoters of lncRNA with no overlap with protein coding gene loci; either promoters of lncRNAs overexpressed in ER positive (n = 2320) or ER negative (n = 536) samples. We compared these two groups of promoters with respect to i) Chromatin accessibility measured by ATAC-seq in 74 TCGA-BRCA patients, ii) ChromHMM, chromatin genome segmentation, and iii) Transcription Factor (TF) -binding sites using the UniBind database 32 .
lncRNA promoters are accessible in an ER-status specific manner. We found lncRNA promoters to be accessible in a lineage specific manner, i.e. promoters of lncRNA overexpressed in ER positive tumors were more open (higher Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) signal) in ER positive samples than in ER negative samples. Similarly, promoters of lncRNAs over-expressed in ER negative tumors showed significantly higher ATAC-seq signal in ER negative samples (Fig. 4b, c, Supplementary Data 6), suggesting that lncRNA promoters are highly regulated in a subtype specific manner.
lncRNA promoters are defined as active regions according to chromHMM. We assessed whether lncRNA promoters were enriched for specific chromHMM regions defined in subtype specific breast cancer cell lines. We mainly observed significant enrichment for 'Promoter Flanking' and 'Enhancer' (Fig. 4d, e, Supplementary Data 6). When expanding the window upstream of the TSS, the enrichment for 'Enhancer' marks became even more significant, with the lncRNAs over-expressed in ER negative tumors showing particularly significant overlap with 'Enhancer' marks in Basal like cell lines ( Supplementary Fig. 10). Specific transcription factors binding sites are found at lncRNA promoters. We next sought for enrichment in transcription factor binding sites (TFBS) in lncRNAs promoters using the UniBind database 32 Fig. 11a-c). When extending the window upstream of the TSS for ER-lncRNA there was also enrichment for CEBPB, a transcription factor involved in inflammatory response 34 , and several additional AP-1 family members with known function in dendritic cell identity 35 ( Supplementary Fig. 11d-f). Altogether, these results gave insight into the regulatory programs specifically at lncRNA promoters and showed that this regulation is closely related to estrogen receptor status in breast cancer.
Identifying distal regulatory regions for lncRNA. Finally, we sought for distal regulatory regions for lncRNA in breast cancer. We used our previously published method 36 , which is efficient at identifying distal enhancer and long-range interactions between enhancers and promoters through negative correlations between DNA methylation and transcript expression. We correlated the levels of DNA methylation at CpGs and lncRNA expression for all CpGs and lncRNAs on the same chromosome in two cohorts for which DNA methylation and lncRNA expression were available TCGA-BRCA (n = 603) and OSLO2 (n = 279). As the OSLO2 lncRNA expression was measured by Agilent microarray 60 K, we focused on 1027 lncRNAs found in both cohorts ( Supplementary Fig. 12). For both cohorts, we identified 26342 CpGs significantly inversely correlated with 396 lncRNA (Bonferroni corrected Spearman correlation p-value < 0.05). We first tested in which chromHMM regions the CpGs whose DNA methylation was inversely correlated with lncRNAs were located and found them significantly enriched in enhancer regions (Fig. 5a, Supplementary Data 7). CpGs negatively correlated with lncRNAs highly expressed in ER positive tumors were found in open chromatin regions significantly more open in ER positive samples according to the TCGA-BRCA ATAC-seq data (Fig. 5b). Correspondingly, CpGs negatively correlated with lncRNAs highly expressed in ER negative breast cancer were found in regions significantly more open in ER negative tumors (Fig. 5c). Further confirming that the CpG in cis inverse correlation with lncRNA expression pointed at biologically relevant and active distal regulatory regions, we found such CpGs near binding sites of TFs described at breast cancer enhancers (Fig. 5d).
The LINC01488 locus provides a good illustration of distal regulatory regions possibly involved in the regulation of lncRNA expression (Fig. 5e). LINC01488 expression showed negative correlation to distant CpGs on the same chromosome in the TCGA-BRCA and OSLO2 cohorts (Fig. 5e). A specific negative correlation between LINC01488 expression and DNA methylation levels at a CpG (cg00211115) in an upstream active enhancer region is shown in Fig. 5f (OSLO2) and Fig. 5g (TCGA). This CpG has lower levels of methylation in ER positive patients and was found to reside within the binding sites of key transcription factors (ESR1, FOXA1, and GATA3, ChIP-seq). Furthermore, experimental long-range interactions defined by Pol2 binding (ChIA-PET Pol2 data), showed an interaction, loop, between the distal enhancer and LINC01488 TSS (Fig. 5e). LINC01488 was also detected in a long-range interaction with CCND1 (Fig. 5h) and showed significant correlation to CCND1 expression in both SCAN-B (Fig. 5i) and the TCGA-BRCA cohort ( Supplementary  Fig. 13). Other examples of lncRNAs with inverse correlation with DNA CpG methylation at enhancer sites that reside in longrange interactions are shown in Supplementary Figs. 13 and 14 and Supplementary Data 7, or lncRNAs in long-range interactions with protein coding mRNAs (Supplementary Data 7).
Altogether, these analyses show that integration of lncRNA expression with DNA methylation and long-range interaction data aids in identifying subtype-specific distal regulatory regions for lncRNA.

Discussion
This study is, to our knowledge, the first to identify lncRNAs associated to clinicopathological features in breast cancer using two large independent cohorts. Combining the analysis of the SCAN-B and TCGA-BRCA RNA-seq data allowed us to assess the expression of more than 4000 lncRNAs with respect to breast cancer clinicopathological features and to report lncRNAs with robust association to clinical features across patient cohorts with remarkable concordance. We identified more than 2800 lncRNA genes, almost 70% of all the lncRNAs included in this analysis, with significant differential expression between ER positive and ER negative breast cancer. This is in line with the previous observations based on the TCGA-BRCA cohort alone 13 .
Characterization of lncRNA functions remains a critical challenge 37 . Here, to approach this question from an in silico point of view, we grouped lncRNAs based on their correlation with all protein coding mRNAs using hierarchical clustering. This expands on previous studies that have focused on selected lncRNAs or identified pathway enrichment of neighboring mRNAs 15,16 . Our analysis revealed how lncRNA expression is related to underlying features of inter-and intra-tumor heterogeneity. While lncRNAs of cluster 1 were mainly over-expressed in ER positive breast cancers and were found to be associated with estrogen signaling, the expression of lncRNAs of cluster 2 and 3 were mainly explained by cells from the tumor microenvironment. This further underlines the highly cell-and tissuespecific expression of lncRNAs 4,5 .
Cluster 2 -lncRNAs had their expression mostly explained by an in silico computed fibroblast score. We further verified this Fig. 4 Functional annotation of lncRNA promoters. a Schematic overview of the definition of lncRNA promoters not overlapping with a protein coding gene locus. bp: base pair; PC: protein-coding; TSS: transcription start site. b, c Average normalized counts for ATAC-seq peaks mapped to lncRNA promoters in estrogen receptor (ER) positive (+) (blue dots) (n = 58) and ER negative (−) (red dots) (n = 12) breast tumor samples from the TCGA-BRCA cohort. Wilcoxon test p-values are denoted. The line within each box represents the median. Upper and lower edges of each box represent 75th and 25th percentile, respectively. The whiskers represent the lowest datum still within [1.5 × (75th − 25th percentile)] of the lower quartile, and the highest datum still within [1.5 × (75th − 25th percentile)] of the upper quartile. b Promoters of independent lncRNAs overexpressed in ER positive cases and c promoters of independent lncRNAs overexpressed in ER negative cases. d, e Enrichment of independent lncRNA promoters across ChromHMM genome segmentation from breast cancer cell lines. Enrichment is calculated as the ratio between the frequency of lncRNA promoters found within a specific segment type, over the frequency of all lncRNA promoters within the same segment type. The length of the bars (x-axis) shows the log transformed BH corrected p-value from the hypergeometric test. d Promoters of independent lncRNAs overexpressed in ER positive cases and e promoters of independent lncRNAs overexpressed in ER negative cases. Active Enhancer=EhAct, Active Promoter = PrAct, Repeat Zink Finger = RpZNF, Flanking Promoter region = PrFlk. f, g Swarm plots showing enrichment of TF binding sites (-(log10(p-value) using Fisher's exact tests) on the y-axis for specific sets of promoters according to UniBind. TF names of the top 10 enriched TF binding sites data sets are annotated by colours. f Promoters of independent lncRNA overexpressed in ER positive cases and g promoters of independent lncRNAs overexpressed in ER negative cases. observation using single cell RNA-seq data and confirmed that many cluster 2 -lncRNAs were expressed by fibroblasts. One such lncRNA was MEG3, which was shown to contribute to the development of cardiac fibrosis 38 . Further, the expression of the mature miRNAs hsa-miR-99a and hsa-miR-100 have previously been associated to fibroblasts in breast cancer 39 . We found that the corresponding miRNA precursor transcripts which themselves are lncRNAs were part of cluster 2 and were indeed expressed in fibroblasts of the tumor breast microenvironment. Pathway enrichment of the cluster 2 associated mRNAs showed association to EMT. Expression of a cluster 2 lncRNA, ROCR, has been reported to regulate SOX9 expression in both mesenchymal stem cells 40 , and basal-like breast cancer cells, where it promoted proliferation 41 . In this study we identify ROCR as the lncRNA that most significantly differentiates Luminal A and Luminal B breast cancer patients. Interestingly, NR2F1-AS1 has recently been shown to be up-regulated in mesenchymal-like breast cancer stem-like cells, contributing to tumor dissemination 42  invasion and metastasis 43 , and further studies are needed to establish whether other lncRNAs from cluster 2 directly contribute to invasiveness in breast cancer. Gene set enrichment terms for mRNAs associated to cluster 3 lncRNAs point to hot tumors with high immune infiltration. We were able to identify several lncRNAs from this cluster which were expressed by tumor infiltrating immune cells. A recent pancancer study of patients in the TCGA cohort identified a panel of immune-related lncRNAs [34] which could stratify non-small cell lung cancer in three subgroups with differences in response to chemotherapy, and prognosis. Cluster 3 lncRNAs were identified as regulators of immune-related pathways in 44 and had higher expression in ER negative patients. Knowledge about the specific cell types that express lncRNAs can improve our understanding of their function in cancer. We believe our identification of immune-related and fibroblast-associated lncRNAs can serve as a useful resource to choose relevant model systems for more indepth functional characterization of lncRNAs.
To identify transcriptional regulation of lncRNAs, independent of the regulation of the neighbor protein coding gene, we first separated lncRNA based on whether their promoters overlap with the protein coding gene loci or not. lncRNA-mRNA pairs where the lncRNA promoters were located within the protein coding gene locus showed significantly higher correlation than other lncRNA-mRNA pairs. This has been reported previously in AML patients and cell-lines and indicates a shared cis-regulation between lncRNAs and protein coding gene at the same locus 45 .
Higher enhancer activity has been attributed to lncRNA transcription 7 , and tissue-specific expression of lncRNAs at enhancer regions suggests a role in determining lineage-specific gene expression [4]. We found an enrichment of chromatin features associated with active enhancer regions in lncRNA promoters, which may further indicate that these lncRNAs originate from subtype specific regulatory elements that are active in cancer cells.
The most significant enrichment at lncRNA promoters with high expression in ER-patients was for repeat sequences/ZNF gene clusters. Repeat and transposable elements play a role in both the origin, and regulation of lncRNAs [40], and we cannot rule out transcription in these areas due to hypo-methylation/derepression of otherwise silenced genomic regions.
The transcription factors FOXA1 and ESR1 bind to active enhancers in breast cancer 46,47 , and are important for lineage determination 36 . We found enrichment for FOXA1 and ESR1 binding sites at the independent promoters of lncRNAs with high expression in ER positive patients, which provides further evidence for an association between the expression of some of these lncRNAs to enhancer function. lncRNAs with enhancer functions can regulate nearby protein coding genes [36]. LINC01488 has been shown to mediate breast cancer risk by playing a role in homologous recombination (HR)mediated DNA repair. The risk SNP resides in a distant enhancer of CCND1, which is also involved in estrogen induction of LINC01488 expression 23 . Here, we identify several distal enhancer regions in long-range interaction with the TSS of LINC01488. We show lower levels of DNA methylation at these enhancers in ER positive patients. The lncRNA is also in long-range interaction with the neighboring gene, CCND1. LINC01488 shares a bivalent promoter with AP000439.2 This lncRNA was not detected in the OSLO2 cohort (measured by microarray), and it is possible that the same distal regulatory regions are involved in regulation of both these neighbor lncRNAs.
A significant correlation between LINC01488 and CCND1 expression was observed in both the SCAN-B and TCGA-BRCA cohorts. In the study by Betts et al. knockdown of LINC01488 resulted in decreased expression of CCND1. Further studies are necessary to determine the role of LINC01488 on CCND1 expression, and to identify other enhancer lncRNAs that may function in gene regulation of protein coding genes in breast cancer subtypes.
In conclusion, we find a large number of lncRNAs with specific expression related to clinicopathological features in breast cancer. In breast cancer lncRNA expression associate to specific pathways known to play a role in pathogenesis, as well as specific cell types infiltrating breast tumors. We show that promoters of lncRNAs are enriched in regulatory regions and TF relevant to breast cancer, indicating active transcriptional regulation and association to lineage specific enhancers in breast cancer subtypes.

Methods
Patient material. Two independent breast cancer cohorts with RNA-seq data were used; SCAN-B (n = 3455) 48 and The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) cohort (n = 1095) 49 . A third independent cohort, the OSLO2 breast cancer cohort for which lncRNA expression were measured by Agilent 60 K array 50,51 , was also included. Transcript abundance from kallisto were summarized to gene level expression using tximport 53 (v1.16.1) in R. lncRNAs were defined as genes in the Ensembl (v93) non-coding reference with a length above 200 bp. lncRNA expressed at >1 TPM in >5% of samples in the cohort (Supplementary Fig. 1), with an interquartile range >0.1 (IQR function in R) were included in the downstream analysis (Supplementary Data 9). Hierarchical clustering of patients was performed using hclust as part of the pheatmap package (v1.0.12) in R with correlation distance and ward D2 as agglomeration method (Fig. 1a, b).
The Oslo2 breast cancer cohort. The Oslo2 breast cancer cohort has been previously described 39,50,51 and is a consecutive study collecting material from breast cancer patients with primary operable disease at several hospitals in south-eastern Norway. Patients were included in the years 2006-2019. The study was approved by the Norwegian Regional Committee for Medical Research Ethics (approval number 1.2006.1607, amendment 1.2007.1125), and patients have given written informed consent for the use of material for research purposes. All experimental methods performed are in compliance with the Helsinki Declaration. The mRNA expression data have been previously published and can be obtained from GEO with accession number GSE58215 50 . To accurately assign array probes to lncRNAs, published probe sequences (GEO Platform GPL14550) were aligned to Ensembl.93 noncoding reference sequences (Homo_sapiens.GRCh38.ncrna.fa) using blast (ncbiblast-2.6.0). Probes where all 60 bp matched 100% to the reference were included in the analysis. In the case where several probes could detect the same lncRNA, the mean expression value was used. A total of 4018 probes mapping to 3000 unique Ensembl gene IDs were included in the lncRNA analysis, and 1027 lncRNAs were detected in all three cohorts according to the filtering criteria described for TCGA and SCAN-B.
The Illumina Infinium HumanMethylation450 microarray was used to measure the DNA methylation levels (GSE84207) 57,58 . Preprocessing and normalization involved steps of probe filtering, color bias correction, background subtraction and subset quantile normalization. The DNA methylation data have been previously published 36 .
Differential gene expression analysis. "scaledTPM" values from the tximport function were used to create a DGEList object using edgeR ( Correlation to protein coding genes expression and hierarchical clustering of lncRNAs. Log2(TPM + 1) expression values for lncRNAs were correlated to all protein coding genes with an interquartile range >0.1 (IQR function in R) in the TCGA and SCAN-B cohorts, using Spearman correlation (cor.test in R). lncRNA-mRNA pairs with p-value < 0.05 after Bonferroni correction 59 in both cohorts were included in the subsequent analysis (Supplementary Data 9 and 10). lncRNAs and mRNAs were filtered prior to clustering, retaining only those with i) Spearman Correlation coefficient below −0.4 and above 0.4 in both cohorts, and ii) more than the average number of associations (n = 95 lncRNAs, n = 20 mRNAs, Supplementary Fig. 3b, c). For clustering, Spearman correlation values were binarized to −1/1 for negative and positive correlation respectively. Hierarchical clustering was performed using hclust as part of the pheatmap package (v1.0.12) in R with correlation distance and average linkage. To identify and decide upon the number of lncRNA and mRNA clusters, the dendrograms were visually inspected using different cut-offs on the cutree_rows and cutree_cols functions of the pheatmap package. Cut-offs were manually selected to define the clusters depicted in Fig. 2a (cutree_rows = 3 and cutree_cols = 3).
Gene Set Enrichment analysis (GSEA). Gene Set Enrichment Analysis was carried out using either the 50 Hallmark pathway gene sets 60 (h.all.v7.0.symbols.gmt), or "C5", ontology gene sets 61,62 (c5.all.v7.0.symbols.gmt) from MSigDB 63 . Enrichment was calculated using hypergeometric test (the R function phyper) of the mRNAs in each cluster, against all genes in a gene set. P-values were FDR corrected, and the top 10 pathways with adjusted p-value < 0.05 were used.
Lymphocyte and fibroblast infiltration scores. The Nanodissect algorithm 64 (http://nano.princeton.edu/) was used for in silico estimation of lymphocyte infiltration. The breast collection data (May 2013), which contains 17940 genes measured on 622 arrays, was inspected for genes specifically expressed in lymphocytes (standard genes; n = 476; available online and defined from expert literature review) and not expressed in mammary gland (n = 777) or mammary epithelium (n = 79). The genes with more than 65% probability to be positive lymphocyte-specific standard genes as opposed to mammary gland or epithelium were further used in downstream analysis to score each SCAN-B and TCGA-BRCA samples for the level of lymphocyte infiltration. The average expression of the set of standard genes in a sample reflected lymphocyte infiltration. The xCell algorithm 28 was used to obtain a fibroblast score for SCAN-B samples with log2 (TPM + 1) values as input. For TCGA, xCell scores were downloaded from https://xcell.ucsf. edu/xCell_TCGA_RSEM.txt.
lncRNA expression modeled with generalized linear models. Generalized linear modeling (glm function in R) was used to model lncRNA expression as a function of ESR1 mRNA expression, fibroblast infiltration, and lymphocyte infiltration to estimate which variable(s) explained most each lncRNA expression. Resulting coefficient of such modelling are used in subsequent analysis to estimate the impact of each variable in lncRNA clusters.
RNA-seq from breast cancer cell lines. Gene expression from cell lines representing different breast cancer molecular subtypes: MCF7 and ZR751 (luminal A), MB361 and UACC812 (luminal B), AU565 and SKBR3 (HER2), MB469 and HCC1937 (basal), MB231 and MB436 (Claudin-low), and MCF10A and 76NF2V (Normal breast), each 4 replicates (GSE96860 29 ,) was obtained from the Recount3 project 65 (v 1.2.6) using the recount::getTPM function in R. 911 of the 919 lncRNAs defined in the clustering analysis (Fig. 2a) were available and used to identify differentially expressed lncRNAs in each subtype compared to all other subtypes (wilcox test) using the FindAllMarkers function of the Seurat package (v4.1.0) in R.
Single cell RNA-seq from breast cancer patients. Count matrix of single cell RNA-seq 30 were analyzed using the Seurat package (v3.2.1) in R to obtain UMAP. In brief, count matrix were already filtered for dying cells by the authors. It was further normalized and scaled regressing out potential confounding factors (number of UMIs, number of gene detected in cell. percentage of mitochondrial RNA). After scaling, variably expressed genes were used to construct principal components (PCs) and PCs covering the highest variance in the dataset were selected based on elbow and Jackstraw plots to build the UMAP. Clusters were calculated by the FindClusters function with a resolution between 0.8 and 1.8, and visualized using the UMAP dimensional reduction method.
Nine main cell types were identified on these UMAP based on the authors annotations. The main cell types identified are normal epithelial, cancer epithelial, myeloid, T, B, endothelial cells, plasmablasts, CAF and perivascular-like -fibroblasts.
lncRNA promoter annotation. lncRNA promoters were defined as Transcription Start Site (transcription_start_site), positions obtained from Ensembl (v.93) using BioMart 66 (biomaRt_2.45.6, host = 'http://Jul2018.archive.ensembl.org') −200 bp (upstream of TSS) and +100 bp downstream, and by increasing the upstream window to −300, −500, and −1000. lncRNA transcripts with independent promoters were obtained using bedtools subtract, with the -A flag, of all lncRNA promoters from a background file containing a window spanning 200 bp (300 bp, 500 bp, and 1000 bp for expanding window sizes) upstream of protein coding gene start positions, to gene end position (BEDtools 67 , v2.29.2), remaining transcripts were regarded as overlapping promoters. Overlapping protein coding genes were identified using the bedtools intersect command with the same input as described above, and nearest protein coding gene to independent lncRNAs were identified by bedtools nearest using the default parameters with lncRNA promoter regions (−200/ + 100) and protein coding genes start-stop coordinates.
ATAC-seq data from TCGA-BRCA. Normalized ATAC-seq peak signals (log2((count + 5)PM)−qn) for 74 TCGA breast tumors 68 were downloaded from the Xena browser 54 (https://xenabrowser.net/datapages/). lncRNA promoter positions (−200/ + 100) were intersected with the peak positions using bedtools intersect. To test for differential open regions between ER positive and negative tumors, the average normalized counts of the peaks containing lncRNA promoters were calculated per tumor sample and a Wilcoxon rank-sum test was applied to test for statistical significance using R. lncRNA promoters associated to ER + or ER-tumors were tested separately.
Enrichment of transcription factors binding sites at lncRNA promoters. To assess the enrichment of TFBSs at lncRNA promoters (300 bp window described above), we considered the direct TF-DNA interactions (i.e. TFBSs) stored in the updated version of the UniBind database as of 20.10.2020 32 . These TFBSs were obtained by combining both experimental (through ChIP-seq) and computational (through position weight matrices from JASPAR 69 ) evidence of direct TF-DNA interactions (see ref. 32 for more details). Note that a TF can have multiple sets of TFBSs derived from different ChIP-seq experiments. The enrichment of UniBind TFBS sets in regions surrounding lncRNA promoters against a universe considering all lncRNA promoter regions (window sizes as described above, Ensembl.93) with the UniBind enrichment tool (https://unibind.uio.no/ enrichment/, source code available at https://bitbucket.org/CBGR/unibind_ enrichment/; input R data with TFBS information available on zenodo at https:// doi.org/10.5281/zenodo.4452896). Specifically, the enrichment is computed using the LOLA R package (version 1.12.0) 70 using Fisher's exact tests. Fig. 4 f and g, and Fig. 5d plot the Fisher's exact p-values using swarm plots (swarmplot function of the seaborn Python package, https://doi.org/10.5281/zenodo.824567) with annotations for the TFs associated with top 10 most enriched TFBS sets.
DNA methylation lncRNA expression correlation analysis. Within each data set (OSLO2 and TCGA), CpGs with an interquartile range (IQR) > 0.1 were selected. Considering only CpGs and lncRNAs present in both data sets resulted in 143 631 CpGs and 1027 lncRNAs, and analysis was restricted to lncRNAs and CpGs on the same chromosome (total number of tests n = 7130824). To test the correlation between the level of DNA methylation of CpGs and lncRNA expression (log2 expression (OSLO2) or log2 (TPM + 1) (TCGA)), the Spearman correlation statistics was applied (function cor.test with method = "spearman" in R). An association was considered statistically significant if a Bonferroni-corrected p-value was <0.05. Only significant correlations with the same direction (sign) were kept.
We assessed enrichment of all CpGs with negative correlation to lncRNA expression to each of the 13 chromatin states described above using hypergeometric tests (the R function phyper) with all Illumina Infinium HumanMethylation450 BeadChip CpGs as background (n = 436 506). P-values were corrected using the Benjamini-Hochberg (BH) procedure 59 .
ChIA-PET Pol2 data and ChIP-seq peaks. ChIA-PET Pol2 loop data from the MCF7 cell line was retrieved from ENCODE, accession number ENCSR000CAA 71 . We investigated overlaps between ChIA-PET Pol2 loops and CpGs with negative correlation to lncRNA expression (Supplementary Data 6). A CpG-lncRNA pair was considered to be in a ChIA-PET loop if the CpG and the lncRNA TSS were found in two different feet of the same loop. lncRNA TSS positions were lifted to hg19 coordinates using the UCSC liftOver tool before intersecting with the loop coordinates using BEDtools intersect. Similarly, lncRNA-mRNA pairs were considered to be in a loop if the lncRNA (gene body coordinates) and mRNA (gene body coordinates) were found in two different feet of the same loop. For the specific analyses of MCF7 TF ChIP-seq data sets, we retrieved ENCODE ChIP-seq peak regions from the ReMap 2018 72 database (ENCSR000BST.GATA3.MCF7, ERP000783.ESR1.MCF7, and GSE72249.FOXA1.MCF7).
Statistics and reproducibility. All analyses were performed in the R software were used to define differentially expressed lncRNAs. Benjamini-Hochberg adjusted p-values < 0.05 were considered significant for all tests, unless otherwise stated. Cox proportional hazards regression analysis was performed using the coxph function of the Survival package (v3.3-1) in R with Overall Survival as endpoint. lncRNA (n = 4108)-mRNA (n = 17060) and lncRNA (n = 1027)-CpG (n = 143631) methylation correlation analysis was performed using Spearman correlation (cor.test in R) and Bonferroni corrected for multiple testing. Hypergeometric tests (the R function phyper) were used for GSEA of mRNA clusters (mRNA-cluster A (n = 2890), mRNA-cluster B (n = 1480), and mRNA-cluster C (n = 667)), as well as ChromHMM enrichment analysis. Generalized linear modeling (glm function in R) was used to model lncRNA expression as a function of ESR1 mRNA expression, fibroblast infiltration, and lymphocyte infiltration to estimate which variable(s) explained most each lncRNA expression in SCAN-B (n = 3455) and TCGA (n = 980). Kruskal-Wallis test was used to assess glm coefficients from the three lncRNA Clusters, cluster 1 (n = 610), cluster 2 (n = 199), and cluster 3 (n = 110). Difference in ATAC-seq signal from ER + (n = 58) and ER-(n = 12) breast tumor samples from the TCGA-BRCA cohort was evaluated by Wilcoxon rank-sum test. Fisher's exact tests were used to calculate enrichment of TF binding sites.
Individual statistical tests are described in the relevant sections above and in figure legends.